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Abstract 

This study of the stability of teacher behavior over time is 
formulated through two major questions: (1) Is the behavior of an 
individual teacher consistent over time? and (2) Are individual 
differences among teachers consistent over time? Regrettably, the first 
question has rarely been considered in previous investigations of the 
stability of teacher behavior, and empirical research on the second 
question has been marked by considerable confusion. In this paper we 
develop statistical procedures for answering each of these questions. 
Approaches and methods of previous studies of temporal stability are re- 
evaluated. In addition, methods for assessing the stability of teacher 
behavior across contexts are described. Observational data on classroom 
teachers are used throughout to illustrate our new approaches and 
methoas for the study of stability of behavior. 
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ASSESSING THE STABILITY OF TEACHER BEHAVIOR 

2 

David Rogosdi Robert E. Floden^ and John B. Willett 

Questions About Stability 

Answers to the question, "Is teacher behavior stable?" have been sought by 

researchers for the past hal£-*century • Despite the many extensive studies of 

teacher behavior, affirmative answers to this question have been rare. For 

example, Borich (1977) states, "The results of these studies suggest that 

teacher behavior may be unstable across long periods of time and content" (p. 
3 

300) • Moreover, researchers lack confidence in the results of these research 
efforts; the conclusion of Shavelson and Dempsey-Atwood (1976) that "(a) most 
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The recent review by Darling-Hammond, Wise, and Pease (1983) summarizes 
the research on stability of teaching behavior as follows: "The bottom-line 
question is. Does a given teacher exhibit the same kinds of behavior at 
different points in time and within different teaching contexts? In general, 
the answer is 'no,' especially with regard to measures of specific, discrete 
teaching behaviors" (p. 299). 
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studies are methodologically inadequate at this point in time to resolve the 
issue^ (b) findings on the • • . stability of measures of teacher behavior are 
equivocal with only a few exceptions" (p. 609) is typical of the reviews of this 
literature. 

A major problem in past investigations of stability is that the question^ 
"Is teacher behavior stable?" is not well formulated. Before satisfactory 
conclusions about stability of teacher behavior can be drawn^ the research 
question must be refined. 

In this paper we formulate and pursue two research questions about 
stability of teacher behavior over time: 

Question A» Is the behavior of an individual teacher consistent 
over time? 

Question fl» Are individual differences among teachers consistent 
over time? 

Although there have been a few studies of the "variety" and "flexibility" of 
teacher behavior (see the studies reviewed in Rosenshine, 1971, Chap. 4), 
Question A has received almost no attention in research on teaching. This 
unfortunate omission reflects the neglect in research on teaching of the 
detailed description of individual teachers. Virtually every empirical study of 
stability of teacher behavior has been directed toward answering jjuestio;: B* 
These studies, however, have not met with great success in demonstrating 
consistency of individual differences among teachers. 

Purposes for Studying Stability 
Before turning to the statistical models and methods that we develop for 
addressing iJuest ions A and fl , it is useful to consider how these questions about 
stability fit in with current and prior research on teaching. Thus we examine 
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the purposes for assessing stability of teacher behavior expressed in the 
research-on-teaching literature. Furthermore, we advocate additional uses for 
assessments of stability of teacher behavior. 

Process-Product Research 

Just as one question {Question B) has received virtually all the attention 
in the research literature, one concern has dominated the study of stability of 
teacher behavior~iow stability as a potential obstacle to the establishment of 
process-product relationships. To demonstrate such relationships, affirmative 
answers to both Question A and Question B are crucial. In the research-on- 
teaching literature, a lacK of stability is frequently cited as an explanation 
for difficulties in demonstrating strong process-product relations. Doyle 
(1977) notes the importance of stability in process-product research: "If there 
is wide variability in either the behaviour of teachers or the instruments used 
to measure that behavior, then estimates of process-product relationships are 
precarious at best" Cp. 169). Furthermore, it has been suggested that "measures 
of teacher behavior may be too unstable to yield consistent relationships with 
student outcomes" (Shavelson & Dempsey-Atwood, 1976, p. 544). 

In detecting process-product relationships, the ability to ranK teachers 
reliably on process is essential. Hence, consistent individual differences in 
teacher behavior facilitate detection of a linK between, individual differences 
in teacher behavior and individual differences in teacher effectiveness. 
Consequently, the effect of negative answers to Question B on the magnitudes of 
potential correlations between process and product measures has been a major 
concern in research on teacher effectiveness (see Brophy, 1979). 

In addition, the interpretability of process-product correlations depends 
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on an affirmative answer to Question A* In process-product correlation 

strategies I process (teacher behavior) is represented by a single number (often 

the average over multiple occasions of observation). Medley (1982) explains 

that the consequence of relying on an average measure of behavior is that "any 

Intentional variations that a teacher introduces to adapt his cr her behavior to 

different purposes are treated as errors of measurement'' (p. 1398). 

Furthermorei an individual teacher's behavior must be consistent over time for a 

single-number summary of process (i.e., the average over occasions) to be a 

reasonably complete description of that teacher's activity in the classroom. 

McGaWi Wardrop, and Bunda (1972) caution that "efforts to develop indices to 

characteri7»e particular teachers appear to be misplaced unless there is some 

allowance made for lawful adaptations of behavior to different situations" (p. 

16). Other investigators have expressed similar concerns about process-product 

research; Berliner (1976) requires that "the behavior should be representative 

of the teacher's usual and customary way of behaving" (p. 7, emphasis in 

original) (see also Doyle, 1977, p. 169; Medley & Mitzel, 1963; Medley, 1979, 

p. 14). Because almost no empirical research on the behavior of individual 

teachers exists, it is impossible to judge whether the presumpcion of 

consistency is reasonable. 

We do not mean to imply that considerations of stability should dictate 

which teacher behaviors are studied. In particular, it is important to note 

that consistency over time is not necessarily a quality to be prized in 

teachers. As Berliner (1976) writes: 

Usually people thinK of good teachers as flexible. Such teachers are 
expected to change methods, techniques and styles to suit particular 
students, curriculum areas, time of day or year, etc. That is, the 
standard of excellence in teaching commonly held implies a teacher whose 
behavior is inherently unstable, (p. 9) 
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Flanders (1969, 1970) also pria^s teacher flexibility and uses measures of 
teacher flexibility as a process variable. 

Moreover, in understanding effective teaching, variables that are stable 
are not necessarily important: "There is little reason to presume on a priori 
grounds that behaviors which are either stable or generaliaable across settings 
are necessarily those that are the most powerful correlates of achievement in a 
given classroom situation" (Doyle, 1977, p. 169). 

Describing Teaching 

Interest in stability of teacher behavior as a description of teacher 
activities has been rare. Yet answers to Questions A and flare an important 
part of understanding teaching. As Berliner (1976) asserts, "Until mor-i is 
Known about which teacher behaviors fluctuate, and how and why they fluctuaLe 
over time, settings, curricula, and populations, studies relating teacher 
behavior to student outcomes must remain primitive" (p. 9). 

Question A addresses the consistency over time of individual teachers' 
behavior. The assessment of stability for individual teachers can be used to 
address many research questions. Characteristics of a group of teachers can be 
investigated using assessments of the consistency of each teacher's behavior. 
For a group of teachers, it would be profitable to investigate questions about 
(a) the group as a whole and (b) individual differences in the consistency over 
time among group members. Examples of questions about the group as a whole are, 
"Are most teachers consistent over time in their classroom behavior?" "For 
which behaviors, if any, are most teachurs consistent over time and for which 
behaviors are most teachers not consistent?" The major question about 
individual differences in consistency is, "What Kinds of teachers are (or are 
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not) consistent in their behavior?" The detection of systematic individual 
differences in consiBte«;cy would be a first step toward understanding how and 
why teacher behaviors fluctuate over time. 

With Question the focus shifts to descriptions of the consistency of 
individual differences in teacher behavior over time. (We emphasize that 
consistency of individual differences is distinct from consistency of an 
individual teacher and from individual differences in consistency.) Some useful 
research questions would be **Are individual differences in behavior among 
teachers consistent over time?" "For what groups of teachers and what 
behaviors?" "In what situations?" and "Over what period of time?" 

Data on a Target Behavio r 
In this paper we develop methods for assessing stability of behavior using 
the Kinds of data usually available from classroom observation instruments. 
Hence, an important first step in assessing stability is to understand the 
structure of the data that are collected. (LacK of attention to the structure 
of the data collected has weakened the value of many previous studies of 
stability.) Typically, data used to describe a teacher behavior have been one 
of three types: (a) behavior-count data, (b) Bernoulli-trial data, and (c) 
quantitative measures. Here, we develop appropriate statistical models and 
procedures for each type. 

^In one of the very few empirical studies of consistency of behavior for 
individual teachers, Flanders (1969) investigated the influence on student 
achievement of "flexibility" of individual teachers over time and over contexts. 
Flanders found indications of an association between flexibility and 
achievement, concluding that "investigations in this area are liKely to be 
rewarding" (p. 109). Thus a possible implication of Flanders' study is that 
effective teachers may be the ones who are not consistent in their behavior. 

o 

ERIC 



7 



Behavior-Count Data 

Behavior-count data are a familiar form of classroom observation data. For 
each occasion of observation, the behavior-couacs consist of the number of times 
a target behavior occurs while observation is in progress. Ordinarily, no 
information on the timing or the duration of the behaviors is recorded, only the 
frequency of occurrence of the target behavior,^ Table 1 shows behavior-count 
data for two different teachers from the Texas Junior High School Study 
(Evertson, Anderson, Edgar, Minter, & Brophy, 1977; see Appendix B for a more 
complete description of these data). 

In Table 1 we retain the original teacher identification codes from the 
Texas Junior High School Study. Shown in the first row for each teacher are 
counts of behavioral criticism by the junior high school teacher during English 
instruction. (That is, the target behavior occurred whenever the junior high 
school English teacher gave a negative evaluation of student behavior, such as, 
"expressing anger or personal criticism," Evertson & Veldman, 1981, p. 157.) In 
the second row for each teacher are the number of hours of classroom observation 
in each month. 

To set notation for behavior-count data, let Xj. denote the number of 
recorded occurrences of the target behavior during the observation of an 
individual teaf -.it-r on Occasion i. (The subscript i indexes the observation 
Occasion; i = 1, . . . , T.) Also let bi indicate the length of the observation 
period on Occasion i (in Table 1, the number of hours of instruction observed 

^As is discussed in the section on design and in Appendix A, the usual 
procedure of recording only the frequency of occurrence of the target behavior 
discards valuable information. 
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Table 1 

Behavior**Count Data for Two Teachers 



Month 

Data 1 2 3 4 S 6 

Teacher #21011 

Behavioral 

criticisms 7 3 3 19 16 26 

Hours of 

observation 3 114 5 6 

Empirical race 2.3 3.0 3.0 4.7 3.2 4.33 



Teacher i^21032 

Behavioral 

criticisms 7 10 5 8 4 40 

Hours of 

observation 13 2 4 6 5 

Empirical rate 7.0 3.3 2.5 2.0 0.67 8.0 



Note , The six months are November through April. Data were taKen from 
the Texas Junior Uigh study. See Appendix B for description of these 
data. 



each month). A natural statistical model for the behavior-count data states 

that, for occasion i, the are sampled from a Poisson distribution with rate 

parameter Xi (see Appendix A). (Statistical procedures based on the Poisson 

distribution are applicable especially to the study of rare events, such as 

certain classroom-management behaviors.) The rates of occurrence of the target 
behavior (X^, . . ., X^) are the Key behavioral parameters in the analysis of 

consistency over time for behavior-count data. 

Bernoulli-trial Data 

A novel and useful way to represent certain teacher behaviors is through 
the use of Bernoulli-trial data. Rather than recording only the frequency of a 
target behavior within a Lime interval, the investigator records the number of 
times the behavior occurs (the successes) among the opportunities for its 
occurrence (the trials). 

Table 2 displays data adapted from Moon's (1969, 1971) study of elementary 
school science teachers (for a description of these data, see Appendix B). The 
target behavior is teacher questioning, in particular the asKing of lower-order 
questions. For example on Occasion 5, 27 questions were asued by Teacher 8 of 
which 2 were lower-order questions, while on Occasion 6, 40 questions were asKed 
of which 36 were lower-order questions. 

In the Bernoulli-trial formulation, each question asKed by the teacher 
constitutes a trial; the occurrence of a lower-order question is considered a 
success for that trial. The outcome of each Bernoulli trial (teacher question) 
is dichotomous; the outcome is one if the teacher question is a lower-order 
question and zero if it is not a lower-order question. The second row of the 
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Table 2 





Bernoulli- 


-trial Data for Two Teachers 






Occasion 


Data 


3 


4 5 6 


Teacher No. 13 


Lower-order 
questions 


8 


11 ' 12 13 


All Teacher 
questions 


24 


49 35 75 


Empirical 
proportion 


.33 


.22 .14 .26 


Teacher No. 8 


Lower-order 
questions 


23 


6 2 36 


All Teacher 
questions 


66 


29 27 40 


Empirical 
proportion 


.35 


.21 .07 .90 



Note. Data were tanen from Moon' s elementary-science study. S 
Appendix B for description of these data. 
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data for each of the two teachers in Table 2 contains the total number of trials 
(teacher questions) occurring on Occasion i, which is denoted by n^. In the 
first row are che total number of successes (lower-order questions) on occasion 
i, denoted by Xi. (The Xi are the sum of the ni dichotoraous outcomes.) 

For some teacher behaviors, use of the Bernoulli-trial representation is a 
useful alternative the behavior-count representation. For example, on some 
days a teacher may asK more questions than on other days. Consequently, even if 
the teacher uses the same mix of higher-order and lower-order questions on each 
day, the number of lower-order questions would differ greatly from day to day, 
whereas the proportion of lower-order questions would remain relatively 
constant. The Bernoulli-trial representation may better reflect what teachers 
intend to do. 

Furthermore, our formulation of teacher questioning behavior through the 
use of Bernoulli-trial data is consistent with the considerable empirical 
research on the degree to which teachers use higher-order versus lower-order 
questions. That is, the mix of questions is what some researchers consider to 
be educationally important, as opposed to only the frequency of particular types 
of questions. In recent reviews of research on teacher questioning (Winne, 
1979; Redfield & Rousseau, 1981) interest centers on the effects of the contrast 
between teaching "dominated by fact questions" and teaching with "a greater 
proportion of higher cognitive questions" (Winne, 1979, p. 14). Earlier 
research on types of teacher questions is described in Gall (1970), Shavelson, 
Berliner, Ravitch & Loeding (1974), and Ryan (1973, 1974). 

The Bernoulli-trial representation is also applicable to the analysis of 
contingent behaviors. For example, after a student has answered a question 
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correctlyi the teacher may or may not praise that response. In representing 
this behavior sequence , teacher behavior would be dichotomized— the teacher 
either praises a student's correct response or does not praise that response. 
The correct student answer defines a trial; the occurrence of teacher praise in 
response to the student's answer is a success. 

Another example of a behavior sequence for which the Bernoulli*trial 
representation is appropriate arises from the study of teachers' verbal behavior 
durin> reading lessons reported by Allington (1980). The behavior sequence of 
interest was the teacher's response (interruption or not) following an oral- 
reading error. In our formulation a student oral**reading error would define a 
trial for which the outcome is whether or not the teacher interrupts the 
student. Allington finds marked differences in teacher interruption behavior 
with students of high and low reading ability and advocates , among a number of 
directions for further research, that "research should identify whether teachers 
are consistent across time [on interruption behavior]" (p. 376). 

Statistical models for Bernoulli-trial data are based on the biaomial 
distribution (see Appendix A). For Occasion i, the probability of a success 
(e.g., a lower-order question) in a single trial is TTi. The tt^, • • • i ttt 
the Key behavioral parameters in the analysis of consistency over time for 
Bernoulli-trial data. 

An extension of the Bernoulli-trial formulation, using the multinomial 
distribution, would b« appropriate for a trial having more than two possible 
outcomes. For example, Moon (1969) actually classified teacher questions into 
five categories: recall facts, see relationships, maKe observations, 
hypothesize, test hypotheses. In Table 2 the five categories have been 
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collapsed into two categories: recall facts (lower-order questions) versus 
other Kinds of questions. We emphasize the binomial model because of its 
simplicity and wide applicability. Use of the multinomial model may be 
advantageous when the multiple outcomes of a trial can be clearly categorized. 

Quantitative Measures 

The third type of data on teacher behavior are quantitative measures. 
Examples of quantitative measures include high-inference measures such as 
observer ratings of teacher behavior (perhaps averaged over raters), derived 
quantities line the indirectness-directness ratio of Flanders (1970), or a 
quantity such as the number of minutes of transition time between classroom 
activities. Tables 3 and 4 display examples of quantitative measures. In Table 
3 are values of Flanders 's indirectness-directness (I/D) ratio over five 
occasions for two of the teachers observed by Moon (1969). In Table 4 are 
monthly averages of ratings of positive affect (rated on a zero-to-four scale) 
in two English classes from the Texas Junior High School Study (Evertson & 
Veldman, 1981). (For a description of these data, see Appendix B.) These two 
classes were taught by the same teacher. 

We model these quantitative measures by assuming that, on Occasion i, the 
measure has a Gaussian (normal) distribution with mean \Xi and variance a?. 
(Sometimes a standard data transformation may be useful to maKe the assumption 
of 3 Gaussian distribution more reasonable.) The means at each occasion 
y. are the Key behavioral parameters in assessing consistency over time 

for quantitative measures. 
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Table 3 

Indirectness-Directness (I/D) Ration tor Two Teachers 



Occasion 

Teacher 2 3 4 5 6 

6 0.87 0.88 0.88 1.29 0.72 

1 1.11 6.30 1.36 0.65 4.39 



Note. Data were taKen from Moon's elementary-school science study. See 
Appendix B for a description of these data. 

Table 4 

Ratings uf Positive Affect for Two Classes 



Month 

Class 1 2 3 4 5 6 

24033 3.00 3.33 2.57 2.00 1.33 1.00 

24035 3.50 3.33 3.00 3.00 2.33 2.50 



Note . Data were taKen from the Texas Junior High School Study. See 
Appendix B for a description of these data. 



19 



15 



Represeating Stabi l ity for the Behavior 
of an Individual Teacher 

The temporal stability (consistency over time) of the behavior of an 
individual teacher, which is the subject of Question A, requires that the 
behavior of the teacher remain unchanged over time. In the homogeneity 
hypotheses below, the consistency over time of the behavior of an individual 
teach&r is formally represented in terms of the behavioral parameter for each 
type of data. 

Behavioi-count data . The rate parameter, Xi, of the Poisson distribution 
is constant over occasions: = X for all i. 

Bernoulli-trial data . The parameter, Hi, of the Binomial distribution is 
constant over occasions: Tl^ = for all i. 

Quantitative measures- . The parameter Mi of the Gaussian distribution is 
constant over occasions: Hi = M for all i. 

Although these representations of stability are straightforward, possible 
departures from stability are numerous and complex. Figure 1 displays different 
ways in which a homogeneity hypothesis can be violated. For each of the four 
plots, the behavioral parameter (Xi, Tli, or Mi) is plotted on the vertical axis, 
and time (ordered occasions of observation) is plotted on the horizontal axis. 
The homogeneity hypothesis is satisfied in the upper-left quadrant of Figure 1; 
the label, "Absolute Invariance," is after Wohlwill (1973, Figure 12.2). In 
this quadrant the behavioral parameter is unchanging over occasions. 

The other three quadrants of Figure 1 depict specific configurations in 
which the homogeneity hypothesis is not satisfied. These configurations are 
presented to stress that rejection of a homogeneity hypothesis can be due to 
different forms of heterogeneity. In the upper right quadrant the behavioral 
parameter follows a systematic time trend. (If this particular configuration 
were "detrended," the adjusted behavioral parameter would satisfy the 
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gure 1. Illustrations of stability (consistency over time) and departures 
from stability; in each quadrant, the vertical axis is the 
behavioral parameter and the horizontal axis is time. 
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homogeneity hypothesis.) In the lower left quadrant, there is no time trend, but 

6 

the scatter of the behavioral parameter violates the homogeneity hypothesis. In 
the lower right quadrant, both the systematic time trend and the scatter 
contribute to the heterogeneity over time. 

Statistical Procedures for the Behavior of an Indivi dual Teacher 
The statistical procedures for addressing Question A— the consistency over 
time of the behavior of an individual teacher— have two main functions: (a) to 
assess the viability of the relevant homogeneity hypothesis, and (b) to estimate 
the amount of heterogeneity. The exact form of the appropriate statistical 
procedure differs for the three types of data; each type of data is considered 
in turn. 

Statistical Procedures for Behavior-Count Data 

Testing homogeneity . One obvious way to assess the consistency over time 
of the behavior of an individual teacher is to perform, using that teacher's 
data, a test of the homogeneity hypothesis— X i = ^ for all i— against the 
general alternative of non-homogeneity, that not all the \i are equal. The 
traditional test statistic for this null hypothesis is presented in equation Al 
of Appendix A. This test statistic assesses whether the observed counts (the 
Xi) are more spread out than would be expected under the homogeneity hypothesis, 
provided that the assumptions of the Poisson model hold. 



^In statistical terms this quadrant represents a stationary, doubly- 
stochastic process, such as a doubly-stochastic Poisson process where the Xi are 
themselves realizations of a stationary stocnastic process (see for example, Cox 
& Lewis, 1966, Section 4.7). 
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Estimati ng heterogeneity . A useful supplement to testing the homogeneity 
hypothesis is to estimate the amount of heterogeneity in the teacher's behavior. 
This is represented by the variance of theXi, a^. Because the homogeneity 
hypothesis posits that is zero, an estimate of is a natural measure of 
heterogeneity. This estimate of for a teacher can be considered a 
description of that teacher's behavior in the same way that the average rate of 
behavior describes that teacher. Cox (1955, Section 5.3) developed a variance 
component estimation procedure for that does not require choosing a specific 
distribution for theXi. Alternatively, when all the observation periods are 
the same length and the Xi can be assumed to have a gamma distribution, 
estimation procedures based on the negative binomial distribution are 
appropriate (see Appendix A). 

Trends. One important violation of the homogeneity hypothesis is a time- 
dependence for the rate of behaviors (as shown in the upper- and lower-right 
quadrants of Figure 1). For example, the frequency of call-outs by students may 
decline systematically over the course of a school year. Methods for modeling 
and analyzing the time dependence of the X^ are presented in Cox and Lewis 
(1966, Chapter 3). 

Examples . To illustrate statistical procedures for behavior-count data, 
analyses of the data in Table 1 are presented in Table 5. For each teacher the 
empirical rates of behavioral criticism, Xi = X^/bi, are shown in Table 1. 
Displayed in Table 5 are the test statistic values for the homogeneity 
hypothesis (from equation Al of Appendix A) and also estimates of the mean and 
variance of the distribution of the Xi« 



23 



19 



Table 5 

Heterogeneity-in-Counts Analysis 
£ot the Data from Table 1 



Estimated rooments of the X distribution 

Homogeneity test 

Teacher statistic* Mean Variance 



21011 



3.96 (5) 4.77 0.00 



21052 49.0 (5) 3.92 6.95 



lumbers in parentheses are the degrees of freedom for the test statistic. 

Although t;he two teachers have comparable average rates of occurrence for 
the target behavior (about four to five instances of behavioral criticism per 
hour), inspection of the h i« Table 1 shows that Teacher 21052 appears to be 
far less consistent over time than Teacher 21011. More formally, the test 
statistic for the homogeneity hypothesis falls far short of statistical 
significance for Teacher 21011 and is highly significant for Teacher 21052. The 
estimate of for Teacher 21011 is set to aero because the variance component 
estimate was negative. (A zero estimate is consistent with failure to reject 
the homogeneity hypothesis.) Teacher 21052 exhibits considerable heterogeneity, 
with an estimate for '^^ of 6.95. 

In an analysis of a collection of teachers, we apply these statistical 
procedures to the data of each teacher separately. A major purpose of the 
analysis of a collection of teachers is to examine the consistency over time of 
individual teachers in relation to each other, that is, to 

investigate individual differences in consistency of behavior over time. For an 
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example of an analysis of a collection of teachers we used another low-inference 
variable from the Texas Junior High School Study, namely, product (lower-order) 
questions (i.e., "questions that have a specific correct answer which can be 
expressed in a single word or short phrase," Evertson & Veldman, 1981, p. 157). 
In Table 6 are results of a heterogeneity-in-counts analysis for 34 English 
teachers and, separately, for 22 mathematics teachers, ujjiig the behavior-count 
data for the product questions variable. 

Table 6 

Summary of Heterogeneity-in-Counts Analysis for Two Collections of Teachers 



English Math 

Estimated Estimated Estimated Estimated 

mean variance mean variance 

Stem of the X of the X of the X of the X 

distribution distribution distribution distributioif 
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03,0 
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10 


334,9 


0,79 


,68 
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0,9 


789,9 


,4 


1, 
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01277, 


9, 


06, 


,17 


4 


2,4 


1,167 


05,456 


,1 


2 


23,13 


889,3 




123,35 


0 


68,6 


00133679,23 


689,9 


111355,057 



Note . Data were taKen from the Texas Junior High School Study. See 
Appendix B for description of these data. 
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The stem-and-leaf diagrams (TuKey, 1977) in Table 6 show the empirical 
distributions of the estimates of V-^ and over each group of teachers. For 
each teacher two quantities are displayed, the estimate of the average (over 
occasions) hourly rate of product questions and the estimate of the 

A 

variability (over occasions) in the rate of product questions ( oj) • For the 
mathematics teachers the largest is 22.3 and the largest 0^ is 214, whereas 

A A 

the smallest is 0.6, and the smallest is 0.5 (rounded up to 1 in Table 6). 
Table 6 does not reveal, for example, that the English teacher (teacher 21023) 

A 

with the largest mean rate of product questions, a of 24.4, also has the 

A 

second largest estimated heterogeneity, a of 281. The homogeneity hypothesis 
is rejected (at level .05) for all of the 22 mathematics teachers and for all 
but one of the 34 English teachers. 

• A 

The stem-and-leaf diagrams for Uj^ illustrate what has been noted 
occasionally in the literature— that teachers differ considerably in their 
average rates of behavioZf for variables such as product questions. The stem- 

A 

and-leaf diagrams for reveal a new aspect of how teachers differ— teachers 
may also differ considerably from one another in the consistency of their rates 
of behavior. 

Statistical Procedures for Bernoulli-Trial. Data 

Testing homogeneity. For Bernoulli-trial data, the appropriate procedure 
is to test the homogeneity (null) hypothesis— "iti. = ^ for all i— against the 
general alternative of non-homogeneity, that not all the \ are equal. A 
traditional test statistic for this null hypothesis is the binomial "index of 
disperjion" (see equation A6 in Appendix A). 
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Estimating heterogeneity > The variance of the TTi distribution, oj, 
represents the heterogeneity in the behavior of an individual teacher over 
occasions. (The homogeneity hypothesis states that is zero.) Various 
estimates o£ CJ^ have been developed in the statistical literature. In Appendix 
A| the estimates developed by HendricKs (1935), Kleinman (1973), and Robertson 
(1951) are described. 

Trends. The existence of a time dependence, or trend, in the is one 
important type of violation of the homogeneity hypothesis. A standard test for 
linear trend was developed by Armitage (1955). 

Examples . To illustrate the statistical procedures for Bernoulli-^trial 
data, analyses ^of the data in Table 2 are presented in Table 7. For each 
teacher the empirical proportions of lower-order questions, p£ » X^/n^, are 
given in Table 2. Displayed in Table 7 are a test statistic for the homogeneity 
hypothesis (from Expression A6) and estimates of the mean and variance of the 



Table 7 



Heterogeneity-in-Proport ions Analysis 
for the Data from Table 2 



Estimated moments of 
the TT distribution 



Teacher 



Homogeneity test 
statistic^ 



Mean 



Variance 



5.06 (3) 



.247 



.001 



8 



58.1 (3) 



.384 



.095 



lumbers in parentheses are the degrees of freedom for the test statistic. 
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distribution of the 11 i (from Kleinman, 1973). Teacher 8 shows appreciable 
heterogeneity over the four occasions, whereas the data for Teacher 13 are 
consistent with the homogeneity hypothesis and the estimate of CJti is nearly 
zero. 

Tables 8 and 9 present analyses of lower-order questions for two additional 
teachers observed by Moon. Clearly, neither teacher's behavior is consistent 

Table 8 

Bernoulli-trial Data 
Lower-order Questions 

Occasion 

Data 3 4 5 6 
1 

Teacher iH5 

Lower-order questions 
All Teacher questions 
Empirical proportions 

Teacher #11 

Lower-order questions 17 
All Teacher questions 34 
Empirical proportions .50 

Note. Data were tanen from Moon's eleraentary-school science study. See 
Appendix B for description of these data. 



5 7 15 16 

56 33 45 28 

.09 .21 .33 .57 



8 7 17 

24 19 93 

.33 .37 .18 
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^ver time; each teacher shows a statistically significant time trend over the 
four occasions of observation* Teacher 15 shows a positive trend over time, and 
teacher 11 shows a negative trend. For estimation of these trends | the four 
occasions were treated as equally spaced in time; use of the exact dates of 
observation yields similar results. These opposite, significant trends 
reinforce our point (see Figure 1) that a violation of absolute invariance does 
not necessarily imply haphazard fluctuation. 

Table 9 

Trend and Heterogeneity-in-Proportions Analysis for Two Teachers 



Teacher 



15 
11 



Homogeneity 

test ^ 
statistic 



23.9 (3) 
13.3 (3) 



Estimated moments of 
the distribution 



Me ail 



.296 
.336 



Variance 



.0265 
.0076 



Analysis of 
linear trend 



Est. Std. 
slope error 



.15 
-.10 



.031 
.028 



lumbers in parentheses are the degrees of freedom for the test statistic. 

Table 10 presents analyses of Bernoulli-trial data for two collections of 
teachers from investigations by Moon and by Trinchero (see Appendix B). A 
question asKed by the teacher constitutes a trial and a lower-order question 
counts as a success. Considerable individual differences in consistency exist 
in both collections of teachers. The estimates of range between 0.0001 and 
0.09^7 for the Moon data and between 0.0 and 0.0824 for the Trinchero data. 
Also, Table 10 reveals that the estimates of for the teachers in the Mooa 
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study differ considerably from the estimates of for teachers in the Trinchero 
study. The homogeneity hypothesis will not be rejected for teachers whose 

2 

estimate of cr^ is very close to zero. For the Moon data, the homogeneity 
hypothesis is rejected (at level .05) for 13 of the 16 teachers* For the 
Trinchero data, the homogeneity hypothesis is rejected (at level .OS) for only 5 
of the 22 teachers. 

Table 10 

Summary of Heterogeneity-in-Proportions Analysis for Two Collections of Teachers 



Moon 



Trinchero 



Stem 



Estimated 

mean 
•of the 71 



distributioD distribution 



Estimated 
variance 
of the TT 



Estimated 
mean 
of the TT 



distributioif distribution*' 



Estimated 
variance 
of the TT 



10 
9 
8 
7 
6 
5 
4 
3 
2 
1 
0 



49 

0112468 

257779 

5 



9 
2 

177 

22446 

1388 



Oil 

11225799 
023577 
2668 
4 



2 

13 



5 

37 



000000000000012 



fnultiply stem. leaf by 0.1 . 
Multiply stem. leaf by 0.01 



The Trinchero data provide an excellent example of the consequences of 
considering the target behavior (lower-order questions) as Bernoulli-trial data 
rather than as behavior-count data. In contrast to the analysis of t.he 
Bernoulli-trial data (for which the homogeneity hypothesis is rejected for 5 of 
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22 teachers), analysis of only the behavior-counts for lower-order questions 
leads to rejection (at level .05) of the homogeneity hypothesis for 19 of the 22 
student teachers I These Bernoulli-trial results provide empirical support for 
the contention that teachers may be consistent in their questioning behavior, a 
finding that would not be obtained if the behavior-count representation were 
used* 

Statistical Procedures for Quantitative Measures 

Testing homogeneity . A direct test of the homogeneity hypothesis for 
quantitative measures (all the Ui are equal) is more difficult to obtain than 
the statistical tests for the homogeneity hypotheses with behavior-count data or 
Bernoulli-trial data. Recall that we adopted a model for quantitative measures 
which states that Xi is drawn from a Gaussian distribution with mean \li and 
variance a?. Because in the Gaussian distribution the mean and variance are 
unrelated, a determination of whether the Xi are more spread out than expected 
(under the homogeneity hypothesis) is impossible without additional information 
on the O^. (That is, for the Poisson model the mean of Xi is bA^, which ij» 
also the variance, and for the binomial model the mean is n.TT. and the variance 



is n.TT. (1 - Tl. ).) 
11 1 



Consider that the Pi have a distribution with mean 9 and variance < . The 
homogeneity hypothesis states that = 0. If the distribution of the m is 
Gaussian, then the compound distribution of the Xi is Gaussian (see Johnson & 
Kotz, 1970, Section 13.7.2) with mean 6 and variance a* + <^ (setting a^^ = 0^ ) . 
Thus, if 0^ vere Known, then a statistical test of the homogeneity hypothesis 
could be conducted by testing the hypothesis that - against the 
alternative that a?. > (i.e., > 0) using standard chi-square methods. In 
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some cases the most useful assumption may be that 0 foJ^ ^11 i. 

Indices of stability . A natural measure of heterogeneity would be an 
estimate of A number of alternative quantities may be useful in situations 

where an estimate of K* is not available. Certainly, the sample variance or 
standard deviation of the Xi for each teacher permits some comparison of the 
consistency over time of the behavior of different teachers. For example, in 
her analysis of children's cognitive development, Bayley (1949) computed the 
standard deviation of each child's IQ score over multiple testings. This 
standard deviation was termed a "Lability Score." In addition, children were 
characterized as "labile" or "stable" if they were in the upper or lower 
quartiles, respectively, of the empirical distribution over all children of 
these standard deviations. Similarly, Flanders (1969) used the standard 
deviation of the indirectness-directness (i/d) ratio across situations and 
occasions to obtain an "index of flexibility" for each of 20 teachers. 

Measures related to the standard deviation such as the coefficient of 
variation, Gini's mean difference and the coefficient of concentration (see 
Kendall & Stuart, 1969) provide similar descriptive information on the 
consistency over time of each teacher. Alternative indices can be adapted from 
the early statistical studies of stability of a statistical series (see Forsyth, 
1932, 1937; BortRiewicz, 1931), which are derived from Lexis theory, an active 
topic in late-nineteenth-century statistical worK. 

Trends . The simple X on t regression function for an individual teacher 
can be used to detect time trends for quantitative measures. The product-moment 
correlation r indicates the strength of the linear time trend for the 
quantitative measure. With many observations on each teacher, more complex time 
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dependencies can be investigated by fitting a polynomial or a non--linear 
function of time. 

Examples » The data in Tables 3 and 4 are used to provide examples of the 
examination of consistency over time for individual teachers using descriptive 
statistics for the quantitative measures. Teacher 6 from Table 3 has a standard 
deviation for the indirectness-directness (I/D) ratio of .21, whereas Teacher 1 
has a standard deviation of 2.46. For the 16 teachers in Moon's study, the 
standard deviation of the indirectness-directness (I/D) ratio ranges between .15 
and 2.46, with a median of .74. Relative to the other teachers observed in this 
study, Teacher 6 appears consistent over time, whereas Teacher 1 has by far the 
least consistent behavior. Neither of these teachers show a time trend for the 
indirectness-Jirectness Cl/D) ratio; the magnitude of r^^^ is less than .10 for 
both teachers. 

In Table 4 are ratings of positive affect from two classes taught by the 
same junior-high-school English teacher. Notable in these data is that t^^ is 
-.96 and -.94 for the two classes respectively — the strongest associations seen 
in the 76 English classes in the Texas study. Naturally, both this teacher's 
classes show highly negative and statistically significant time trends. 

Consistency of Individual Differences 
Question S, concerning the consistency or maintenance of individual 
differences over time, has been pre-eminent in empirical studies of stability. 
One statement of interest in Question B is seen in Shavelson and Dempsey-Atwood 
(1976): "Although it is possible to consistently ranK order teacher performance 
at one point in time it is an empirical question as to whether this ranR 
ordering is stable" (p. 554). Our purposes in this section are to state 
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explicitly what is meant by the consistency of individual differences over time 
and to present statistical procedures for assessing the degree of consistency of 
individual differences. 

Representing Consistency of Individual Differences 

The consistency of individual differences over time is a property of the 
collection of individual time paths for the target behavior. The individual 
time path is a representation of each teacher's behavior as a systematic 
function over time. (For convenience, this exposition will focus on functions 
of quantitative measures over time.) Note that a time dependence of the behavior 
of an individual teacher constitutes a violation of the relevant homogeneity 

hypothesis (see Figure 1). 

Figure 2 displays two examples of (perfectly) consistent individual 
differences over time. In Figure 2a the time path of the target behavior for 
each teacher is a straight line. Wohlwill (1973, Figure 12-7) uses a similar 
diagram to represent the "Preservation of individual differences." In Figure 2b 
individual differences are maintained in a collection of curvilinear time paths. 

Three criteria can be used to define perfectly consistent individual 
differences over time: (a) absolute vertical distance between pairs of time 
paths unchanged over time (i.e., all time paths parallel), (b) relative distance 
between time paths unchanged over time (i.e., percentile ranK of points on each 
time path constant over time), and (c) ranK order of time paths maintained over 
time (i.e., time paths do not intersect). Criterion (a) is more strict than 
criterion (b), which is, in turn, more strict than criterion (c). Both frames 
in Figure 2 show consistency under all three criteria. 

The biometric literature provides methods that can be applied to the study 
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of individual differences among teachers. In this literature, the consistency of 
individual differences over time is described by the term "tracking." A major 
substantive concern in medical research is whether blood pressure tracks — 
whether children with relatively high blood pressure compared to a 
representative group of children maintain that position over time and become 
high-risk adults. The empirical question of whether individual differences are 
preserved over time is precisely the question that has been dominant in research 
on teaching. 

Statistical Procedures — Indices of Tracking 

Perfect tracking of teacher behavior is probably rare, even under the least 
restrictive criterion that rank order be preserved over time. The degree of 
consistency of * individual differences over time can be described by an index of 
tracking. Use of an index of tracking provides important advantages over the 
correlational analyses common in research on teaching. First, an index of 
tracking allows assessment of the consistency of individual differences over 
more than two time points. Second, an index of tracking incorporates explicit 
statistical models for the individual time paths and thus is applicable when 
time trends in behavior are present. 

Foulkes-Davis y . Foulkes and Davis (1981) propose an index of tracking, 
denoted by Y , which reflects "the maintenance over time of relative ranking 
within the response distribution" (p. 439). Thus perfect tracking occurs when a 
collection of individual time paths do not intersect in a specified time 
interval. The index of tracking is defined as the probability that two randomly 
chosen time paths do not intersect during a specified time interval. No 
tracking is said to occur if y < .50. To estimate y, polynomial time trends are 
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fitted to the data for each individual; the estimate of Y is then determined 
from the pairwxse intersections of these fitted time trends.^ Note that, when 
measurements are available at only two points in time, y reduces to an estimate 
of Kendall's probability of concordance; consequently, ? »^ay be thought of as a 
generalization of a ranK correlation coefficient (FoulKes & Davis, 1981, Section 
2). 

McMahan T . An alternative index of tracking appears in McMahan (1981). 
McMahan defines tracking as follows: "For each individual, the expected value 
of the relative deviation from the population mean remains unchanged over time" 
(p. 449), a definition closely related to the maintenance of percentile rank 
over time. The index of tracking represents the degree to which this definition 
is satisfied by the data. Specifically, T represents the variance in X 
(corrected for within-subject error) explained by the individual's relative 

Q 

deviation from the population mean. Unfortunately, in behavioral data, the 
correction for within-subject error may often overstate the real measurement 
error variance, making T less attractive than y for analyses of teacher 
behavior. That is, the correction for errors of measurement in the estimation 
of T is model dependent, and unless the correct model for the time trend in the 
target behavior is fitted, this index of tracking is likely to be seriously 
inflated. Comparisons of T and Y for various data sets are provided by Rogosa 

^The estimate of the index of tracking is model dependent in the sense 
that different degrees of tracking will be seen when different functional forms 
for the individual time paths are fitted (e.g., quadratic vs. cubic fits). 

*For measurements at only two points in time, the index t is a product- 
moment correlation, corrected for attenuation. In the special case in which the 
variance of the measure is the same at each of multiple points in time, the 
index is simply the average of the pairwise, disattenuated correlation 
coefficients • 
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and Willett (1983). 

Example. Using data on product (lower-order) questions for each of six 
months during the school year the consistency of individual differences for 
teachers in the Texas Junior High School Study (Evertson & Veldman, 1981) was 
assessed. First, the observed rate of product questions on each occasion was 
transformed to /rate * 3/8. (This transformation is an effective normalizing 
and variance-stabilizing transformation for Poisson variates.) Then, for each 
teacher a straight-line time trend was fitted to the transformed rate of product 
questions. The collection of these fits for the 25 English teachers is 
displayed in Figure 3. Although these fits have a considerable number of 
intersections, the figure does show that the fitted trends for some teachers 
remain consistently high, while the trends for others are consistently low. For 
these data, the estimate of FoulKes-Davis y is .71 with an approximate standard 
error of .02; this estimate of y reflects reasonably strong consistency of 
individual differences among teachers for product questions. 

Reconsidering Previous Approaches and Methods 
In the research-on-teaching literature three approaches to assessing 
stability have dominated work on stability of teacher behavior: (a) computation 
of correlations among observation times (e.g., Brophy, Coulter, Crawford, 
Evertson, & King, L975), (b) application of generalizability theory (e.g., 
Erlich & Shavelson, L978) and (c) estimation of occasion effects in repeated- 
measures analysis of variance (e.g., Evertson & Veldman, L981). None of these 
approaches explicitly addresses Question A. Furthermore, the information that 
these approaches provide on Question B is limited and sometimes misleading. 

Each of the three approaches uses the same basic data. We denote these 
data by Xij, an individual datum being the measurement obtained for Teacher j 
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Figure 3. A collection of straight-line time trends in rate of product 
questions for 23 English teachers from the Texas Junior High School 
Study, 
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• • • ^n) on Occasion i (i 1| • • • |T). The different approaches 
employ different statistical models for the X^ji and consequently in each 
approach different suimnaries of the X^j are sought. Below we describe each 
approach and give examples from empirical research* We are particularly 
interested in what can or cannot be learned from the statistical methods used in 
these approaches. Often a considerable gap, or even a contradictioni exists 
between the stated research question and the statistical methods employed to 
assess stability. 

Correlations Among Observation Times 

The correlational approachi by far the most common approach to assessing 

stability of behaviori is seen as an extension of test'-retest reliability. Most 

ofteni studies^ using correlations over observation occasions to assess temporal 

stability use only two occasions of observation and report time-one ^ time-two 

correlations as measures of stability. Some studies having observations on more 

than two occasions have adapted the test-retest correlation approach to multiple 

time points. Common to all these studies is the willingness of investigators to 

correlate most anything; whether the data be behavior-counts, proportions, 

rates, complex derived indices (such as ratios), or high-inference ratings, the 

product-moment correlation is used with little concern for the distributional 

9 

assumptions in statistical inference, for the adequacy of a measure of linear 



9 ... 
Although the use of the product-moment correlation as a descriptive 

statistic does not explicitly depend on any assumptions about the bivariate 

distribution of the time-one, time-two data, researchers often report, in 

addition to the correlation coefficient, the results of statistical inference 

procedures (e.g., p-values for the null hypothesis of zero correlation) that do 

depend crucially on the validity of the underlying assumption of a bivariate 

normal distribution. In addition, it seems appropriate to remarK that rejection 

of the null hypothesis of a zero time-one, time-two correlation (say, at level 

.03) is very weaK evidence of stability. However, such a criterion, used 

formally or informally, is widespread in the literature. 
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association! or for the usefulness of data transformations. 

Time-one^ time-tvo correlations. The time-one, time-two correlation, or 

test-retest correlation, is the measure of stability of teacher behavior used in 

almost all empirical research. The data consist of measurements on the target 

behavior on two occasions for each of n teachers. The stability measure is the 

usual product-moment correlation between the measurements at time one and the 

measurements on the same teachers for time two, written as r^ • 

X1X2 

In their influential chapter. Medley and Mitzel (1963) define the 
"stability coefficient" to be "a correlation between scores based on 
observations made by the same observer at different times" (p. 254). They add 
that "the coefficient of stability tells us something about the consistency of 
the behavior from time to time" (p. 254). To amend Medley and Mitzel* s 
interpretation of the "stability coefficient," x^^^^only tells us something 
about the consistency of individual differences in behavior from time to time. 
That is, the correlation provides information only on Question B. 

Applications . A typical example of the correlational approach is seen in 
Brophy, Coulter, Crawford, Evertson, and King (1975)^ They report analyses of 
data obtained by observing 19 second- and third-grade teachers using the 
Classroom Observation Scales. Teachers were observed for two mornings and two 
afternoons in the first year of the study and on 14 occasions in the second year 
of the study. The observations for multiple time periods were reduced to two 
measurements by computing each teacher's mean score for each year. For a number 
of target behaviors, correlations were computed between these mean scores over 
the two years. 

More extensive examples can be found in the Shavelson and Dempsey-Atwood 
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(1976) compendium of tima-onei time-two correlations. Among the teaching 
behaviors included in the Shavelson and Dempsey-Atwood review are various Kinds 
of teacher questions for which they conclude: *'the data from seven studies 
indicate that teacher questioning behavior is unstable'' (1976| p« 605). Among 
the many studies reviewed are those by Moon and TrincherOi the data from which 
have been used in this paper. Interestingly, Shavelson and Dempsey-Atwood 
(1976| Table 1), using two occasions from the Trinchero data, report "stability 
coefficients" of 0.17 for both lower-order and higher-order teacher questions. 
Another example of the use of time-one, time-two correlations is provided by the 
data on "conventional teachers" in Moon's study (1969, 1971) (see Appendix B) 
for which Shavelson and Dempsey-Atwood report "stability coefficients" of 0.42 
for lower-order (factual) questions and 0.64 for the indirectness-directness 
(I/D) ratio. 

Multiple occasions. Similar correlational strategies for addressing 
Question B have sometimes been adopted when data from more than two observation 
occasions are available. The assumption that the test-retest correlation is the 
same for all pairs of time points is the basis for the use of an average 
correlation over all pairs of time points as a "stability coefficient." The 
multiple occasions are employed merely to rei>^licAte the test-retest correlation, 
and the TCT -1)/2 correlations among all time points (the J^x^x^i^ averaged 
arithmetically or, more appropriately, by using Fisher ' s z-transf ormation. 
Alternatively, an intraclass correlation coefficient, based on the model stated 
in Ebel (1951), can be used in place of the average correlation. 

Most importantly, the time ordering of the observations is ignored in this 
averaging of correlations to form a stability coefficient, and thus time trends 
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in the behavior cannot be accoramodated. With multiple time points, if time 
trends in the target behavior exist, then the indices of tracKing are far 
superior to test-retest correlations for addressing Question B* 

An example of the correlational approach with data frow multiple occasions 
is the reanalysis of the teacher questioning data for the final four occasions 
from Moon's study of elementary science teachers which is presented in 
Rosenshine (1973, Table 2). He reports both an average correlation (using 
Fisher's z-transf ormation) and an intraclass correlation for each of the five 
types of teacher questions. As each of the five 4x4 correlation matrices 
include both positive and negative elements of moderate magnitude, the 
intraclass correlation yields "stability coefficients" of zero (or even negative 
coefficients if the average correlation is used). Other examples of this 
approach to the analysis of multiple-time-point data are found in Shavelson and 
Dempsey-Atwood (1976). 

Generalizability Theory 

Generalizability theory provides a second approach to the assessment of 
stability of teacher behavior. The data for generalizability analyses consist 
of two or more observations of a target behavior on each of n teachers, the 
observations being made by k raters. For simplicity, we will consider only the 
case of K»l rater, and thus the data are the Xij as in the previous section. 
(The assumption k = 1 is an extreme simplification of generalizability-theory 
methods, but this restriction is useful for our exposition of applications to 
stability.) The measure of stability is the coefficient of generalizability. 

Generalizability theory is an extension of classical test theory which 
features the separate estimation of several sources of variation in the 
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observations. Applications o£ generalizability theory to the stability of 
teacher behavior have been advocated by Erlich and Bovrich (1979), McGaw, Wardrop 
and Bunda (1972)| and Shavelson and associates (Erlich & Shavelson, 1978; 
Shavelson & Dempsey-Atwood, 1976} Shavelson & Webb, 1981). A useful 
interpretation of Che generalizability coefficient is that it indicates the 
ability to detect differences among teachers' average behavior, that is, the 
generalizability coefficient answers the question, "How well can the average 
behavior of each teacher be located relative to other teachers' average 
behavior?" As discussed earlier in this paper, the ability to differentiate 
teachers on their (average) behavior is crucial for the success of process- 
product research. 

A major limitation of generalizability theory for assessing temporal 
stability is that generalizability theory ignores time trends in behavior by 
focusing on the time average of each teacher, the x\j. (All variation of the 
X^j about X^j is construed as "error.") Specifically, generalizability theory 
assuines a steady state for the behavior over time of each individual teacher: 
"Because our model treats conditions within a facet as unordered, it will not 
deal adequately with the stability of scores that arc subject to trends" 
(Cronbach, Gleser, Nanda, & Rajaratnam, 1972, p. 364). A consequence of 
treating the time facet as unordered is that it must be assumed that all 
individuals are consistent over time, that is. Question A is answered 
affirmatively for each individual (see also Ebel, 1951, p. 409). 

An analysis of variance model for generalizability-theory analysis (with 
one rater) is as follows: 

^ij ''U +8 .+ e . . (i = 1,...,T; j = i,...,n). 

where 3 ; represents the teacher effect. In terms of this one-way, random- 



ERLC 



44 



40 



effects model the question addressed by the use of generalizability theory can 
be expressed precisely as, "Is (the variance component for teachers) big 
compared to the error?" The coefficient of generalizability for this model, 
oV{al + a^/T), measures the ability to distinguish among teachers' mean 
behavior over hypothetical replications of the measurement. The proper 
interpretation of this coefficient in the assessment of stability of behavior is 
that the coefficient of generalizability addresses Question 2 in a peculiar, 
conditional manner. Specifically, conditional on the steady state assumption, 
the coefficient of generalizability assesses individual differences among 
teachers. Consequently, the assumption that all individuals are consistent over 

time is crucial co the interpretation of the generalizability coefficient as a 

... 10 
measure of stability of individual differences. 

In some applications of generalizability theory to analysis of teacher 

behaviors, two-way analysis of variance model has been used: 

Xij = JJ + + 3j + Eij (i = 1,...,T; j = i,...,n) . 

The inclusion of the occasion effect, ai, in this model is contrary to the 

statement of the steady state hypothesis. Although the occasion effect allows 

for trends in individual behavior, this model constrains all individual teachers 

to have the same time trend. The inclusion of the o :iasion effect in the model 



The assumptions underlying the use of the generalizability coefficient 
can be seen from the formulation of Ebel (1951) in which the observations 'Wy 
be considered to consist of a true component and an error. The true component 
is constant in all [T] estimates for any one person but varies from person to 
person" (p. 409). That is, in Ebel's model which underlies the use of the 
intraclass correlation, all individuals are assumed to be consistent over time 
(on true score). The generalizability coefficient (with k = 1) is simply Ebel's 
'•reliability of average rating,'' which can be obtained by applying the Spearman- 
Brown formula to the intra-class correlation (see also Haggard, 1958, pp. 89, 
134), (To linK Ebel's formulation with the present discussion, occasions assume 
the role of raters.) 



41 



serves to reduce the error variance by removing the average trend over all 
teachers from the deviations of X^j about X.j for each teacher .^^ 

Applications of generalizability theory to the analysis of classroom 
observation data can be found in Cronbach et al. (1972| Chap. 7); among their 
examples is a reanalysis of the data from Medley and Mitzel (1963)| on 24 
teachers for five occasions with two raters (see Table 7.1 i p. 191). An 
additional application of generalizability theory to the analysis of teacher 
behavior is Erlich and Shavelson (1978) in which observations on five teachers 
in both reading and math lessons on three occasions (from a sub-study of BTESi 
Phase II, Sandoval, no date) were reanalyzed. Also, Erlich and Borich (1979), 
using data from five occasions on second- and third-grade teachers and the two- 
way analysis of variance model stated above, found that only 35 of the 167 
classroom variables from the Teacher-Child Dyadic Interaction System were 
generalizable— the criterion being that "a generalizable variable was defined in 
this study as one for which a coefficient of generalizability of at least 0.7 
could be obtained by observing the teacher on 10 or fewer occasions" (Erlich & 
Borich, 1979, p. 13). In contrast, Lomax (1982), using observations of student 
behavior during reading instruction from 11 elementary-level learning disability 
classrooms, obtained an "average stability coefficient" of .975 for 30 hour-long 



The distinction between the two analysis-of-variance models for the X^., 
in terms of their consequences for the generalizability coefficient, is 
identical to the considerations in Ebel (1951, p. 411) in his discussion of 
"removing between-raters [occasions] variance from the error term" (see Ebel, 
pp. 410-411, and Haggard, 1958, p. 89, for additional discussion). In some 
presentations of generalizability theory, this distinction is couched in the 
terminology of "relative decisions" versus "absolute decisions" (e.g., Shavelson 
& Webb, 1981, Section 1), and the generalizability coefficients for the one-way 
and two-way analysis of variance models are given in Equations 10 and 9, 
respectively, of Shavelson and Webb (1981)* 
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observation periods. 

Mean Teacher Time Path 

A third approach to assessing stability of teacher behavior is the use of 
repeated-measures analysis of variance to investigate time trends in the mean 
over all teachers. The indication of stability in this approach is the lacK of 
an occasion effect in repeated-measures analysis of variance. Thus, the 
question being addressed can be expressed as, "Is the mean behavior across 
teachers absolutely invariant over time?" 

A way to relate this approach to the questions about stability is to 
consider an assumption that the time trend in the Xi. accurately represents the 
behavior of a typical teacher over time. InvoKing this assumption, the absence 
of an occasion effect can be considered an affirmative answer to Question A for 
every teacher studied. 

The data used in this approach to assessing stability are two or more 
observations on each teacher. The repeated-measures analysis of variance model 
for a single group of teachers can be written; 

Xij = y + a£ + Sj ^Eij 1,...,T; j = i,...,n) . 

The existence of an occasion effect, evidenced by a rejection of the null 
hypothesis that all ai are equal, would indicate lacK of stability. Although 
the model for the Xij is the same as has been used in applications of 
generalizability theory, interest centers on a different parameter. The concern 
here is with theau whereas in generalizability theory the spread of the 3 j 
indicates the stability of the behavior. 

An application of repeated-measures analysis of variance to assessing 
stability of teacher behavior is reported by Evertson and Veldman (1981) in 
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their analyses o£ teacher-behavior data from mathematics and English classes in 
the Texas Junior High School Study* (We use these data extensively in this 
paper.) The data analyzed by Evertson and Veldman are observations o£ behavior 
on six occasions 9 where the data for each occasion is an average of behavior 
observed within a month. Another example of a repeated-measures analysis of 
variance is that of Good, Cooper, and BlaKey (1980), who examined teacher- 
student interaction over time. 

Comparisons and Contradictions 

Each of these three approaches to assessing stability of teacher behavior 
incorporates a different definition (rarely explicit) of stability. These 
differences in definition can be seen by considering the questions addressed by 
each approach. Test-retest correlations address the question, ''Are individual 
differences among teachers maintained from Time 1 to Time 2?'' Applications of 
generalizability theory address the question, "How well can the average behavior 
of each teacher be located relative to other teachers' average behavior?" 
Repeated-measures analysis of variance addresses the question, "Is the mean 
behavior across teachers absolutely invariant over time?" Though all three 
techniques have been used to assess "stability," the definition of stability 
(and thus the quantity being estimated) differs markedly among these three 
approaches . 

Therefore, that assessments of stability from these three approaches may 
contradict each other is not surprising* For example, data on a target behavior 
that exhibits stability by virtue of a flat mean teacher time trend (i.e., no 
occasion effect in the repeated-measures analysis of variance) may have small 
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(or even negative) test*retest correlations. Moreoveri data on a teacher 
behavior may show a flat mean teacher time trend i and high tea t*re test 
correlationi yet show a small or negligible generalizability coefficient* Many 
other contradictions are possible* 

Assessments of stability that follow from explicit statements of 
consistency over time are to be preferred to the less explicitly formulated 
procedures commonly used in research on teaching* Therefore, we recommend that 
statistical procedures based on the homogeneity hypotheses (especially estimates 
of heterogeneity) be used to address i^uestioi] A, and that an index of tracKing 
(in particular, FoulKes-Davis y) be used to address (Question B» 

Stability Across Contexts 
The stability of teacher behaviors across different contexts (e*g*| subject 
matter or class composition) has also been of major research interest (e.g*, 
Brophy et al*i 1973; Shavelson & Dempsey-Atwoodi 1976; Evertson, Anderson, 
Edgar, Minter, & Brophy, 1977)* Research questions and statistical procedures 
resemble those for stability of teacher behaviors over time* Specifically, the 
two research questions for stability across contexts are 

if 

Question A * Is the behavior of an individual teacher consistent 
across contexts? 



This contradiction is exhibited by the data on lower-order (recall facts) 
questions for Moon's SCIS teachers* The reanalysis of these data by Rosenshine 
(1973, Table 2) produced, for the final four occasions in Moon's study, a 
negative average correlation (-*18) and an intraclass correlation of 0*0 (with 
elements of the between-occasion correlation matrix ranging between -*49 and 
•43)* Thus, the correlational approach indicates "low stability of individual 
teacher behavior across observations" (Rosenshine, 1973, p* 223)* In contrast, 
a two-way analysis of variance of these data (with teachers and occasions as the 
factors) carried out by the authors yields an F-statistic of 1*48 (3 and 43 
degrees of freedom) for the occasion effect. Thus by the mean teacher time path 
approach, the teacher behavior is found to be stable* 



Question B . Are individual differences among teachers consistent 
across contexts? 

As It has with stability of behavior over time, empirical research has 
focused exclusively on consistency of individual differences. 

Question A . Statistical procedures for assessing the consistency of an 
individual teacher across contexts can be developed to test a null hypothesis of 
consistency. For example, with behavior-count data, a null hypothesis of 
consistency across two contexts states that the rate of the target behavior for 
teacher j is the same in context 1 as in context 2 (i.e., X^^ ^ ^2j^* 
Bernoulli-'trial data the null hypothesis of consistency across the two contexts 
for teacher j would be tIj^^ ^ 

Often, there may be multiple observations on the target behavior for each 
teacher in each context. For example, in the Texas Junior High School Study 
there are six observations obtained in each of two different class sections (the 
two contexts) for every teacher. For such data, relevant statistical procedures 
for testing the (null) hypothesis of consistency across contexts for each 
teacher can be found in Detre and White (1970) for behavior-count data, and in 
Kleinman (1973) for Bernoulli-trial data. 

Question B . The consistency of individual differences between two 
contexts can be assessed by using a measure of association, such as the product- 
moment correlation coefficient. The measure of association summarizes the 
degree to which teachers who are high (in relation to the other teachers) on the 
target behavior in one context are also high on that target behavior in another 
context, and so forth. Perfect consistency of individual differences would be 
indicated by a correlation of 1.0 • Unlike occasions of observation, different 
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contexts seldom have an obvious ordering. (One example of ordered contexts 
might be low-, middle- and high-ability groups of studeuts.) Hence consistency 
across contexts cannot usually be formulated using tvends for each teacher. 

Example. Data from the Texas Junior High School Study (Evertson & Veldman, 
1981) include observations on teachers for two different classes during a school 
year (e.g., two different sections of eighth-grade English). These data can be 
used to address both Question A and Question D for differences in class 
composition. We illustrate the statistical methods for assessing stability 
across contexts by analyzing the behavior-count data for two commonly studied 
classroom variables: product (lower-order) questions and call-outs by students. 
Six months of data on two English classes for each of 25 teachers were used. 
For product questions, the null hypothesis that the teacher had equal rates of 
behavior in each of the two classes was rejected (at level .05) for 10 of the 25 
teachers using Detre and White's (1970) test statistic. However, the 
consistency of individual differences among teachers was high, with a 
correlation of .84 between rates of product questions for the two classes. A 
different picture is seen for call-outs, where the hypothesis of consistency of 
individual teachers across contexts was rejected (st level .05) for only 4 of 
the 25 teachers. In this case a correlation of .47 across the contexts^^ does 
not indicate high consistency of individual differences among teachers. 

These examples show that with stability across contexts, as with stability 

13 

The reported correlations across contexts for both rate of product 
questio ns and call outs are actually correlations between transformed quantities, 
"finely Aate + 3/8. The correlations using the raw— untransformed— rates were 
.75 and .39, as opposed to .84 and .47, respectively. Clearly, this 
transformation of the rate of behavior improves the linear association across 
contexts. 
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over timei it is important to specify an explicit research question and to apply 
a statistical procedure that corresponds to that research question. A 
correlational analysis of stability for product questions would have indicated 
high stability across contexts. Yet two-fifths of the individual teachers were 
not consistent across contexts* 

Notes on Design 

In the preceding sections, a number of statistical analyses for assessing 
stability have been presented. However, a major aspect of any study of 
stability, the design of the study, has not been explicitly discussed. Design 
considerations include a wide range of invest igatoi' decisions about how to carry 
out the study. In this section, we comment on three important design 
considerations: observation schedules, observation instruments, and homogeneous 
classroom contexts. 

Observation Schedules 

In designing an observation schedule, the investigator must determine how 
often and for how long the target behavior should be observed. Statistical 
considerations can be useful in determining the number of observation occasions 
and the length of the observation period on each occasion. Of course, normal 
classroom activities limit the possible length of any observation period (e.g., 
mathematics instruction cannot be observed for four hours on each occasion). 
Even so, a variety of observation schedules are possible, some of which will be 
more efficient than others. 

Consistency of an individual over time . The statistical design problem is 
to devise an observation schedule that provides (within practical constraints) 
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as much information as possible about the parameter of interest (in particular! 
or ^j^) . For example^ in addressing Question A with behavior-count data^ the 

2 • 

parameter is of interest ^ both for testing the homogeneity hypothesis and in 
estimating heterogeneity. For testing the homogeneity hypothesisi Bartoo and 
Puri (1967) demonstrated (assuming the Poisson model) that it is more efficient 
to observe for relatively long periods on a few occasions than tc allocate the 
same total observation time in shorter sessions over many occasions (e.g., four 
one-hour observations are more efficient than eight half-hour observations). A 

• 2 

similar conclusion is indicated for Bernoulli-trial data and the test of = 0« 
WisniewsKi (1972) demonstrated (assuming the binomial model) *'that a few large 
samples are preferable to many small ones for detecting heterogeneity" (p. 680) 
(e.g., T = 10, ni = 10 is better than T = 20, ni = 5). 

Consistency of individual differences . Currently available guidance on the 
design of observation schedules for addressing Question B is restricted to 
results that depend on the assumption of perfect consistency over time for each 
individual. The most extensive investigations of the efficiency of different 
observation schedules are found in Rowley (1976, 1978), where the effects of 
different observation schedules on reliability or generalizability coefficients 
are analyzed. Rowley bases his analysis on the formulation of Ebel (1931). 
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Less extensive studies with a similar orientation are Rosenshine's 

(1973) effort "to explore the question of the number of observations necessary 
to obtain a trustworthy sample of classroom transactions" (p. 221), Erlich and 
Shavelson's (1973) determination that "an unreasonable number of raters and 
occasions are required to measure certain variables reliably" (p. 88), and 
Erlich and Borich's (1979) analyses "concerning the number of observation 
occasions required to reach a 0.7 level of generalizability . . . for the case 
in which raters are well trained and not considered to be a significant source 
of error" (p. II). Shavelson and Webb (1981, Section 2.6) advocate the use of 
multivariate generalizability analysis to determine the "optimal length of the 
observation period while taking into account the correlations among observation 
intervals" (p. 154). 

er!c 



49 

Rowley's major finding is "for fixed total observation time, greater reliability 
is achieved by the use of a larger number of shorter, independent observations" 
(1978, p. 170). This general conclusion is documented in Figure 2 of Rowley 
(1978). 

Interestingly, Rowley's conclusion about efficient schedules contradicts 
the statistical results cited above with regard to the consistency of an 
individual's behavior over time (i.e., testing and estimation for and op. 
This contradiction does not diminish the accuracy of either finding about the 
construction of efficient observation schedules. However, the contradiction 
does reinforce the commonsense notion that different designs will be optimal or 
desirable for different questions. Furthermore, recall that the conclusions of 
Rowley are based on the model of Ebel (1951) whose formulation employs the 
strong assumption that all individuals are consistent over time in their 
behavior (i.e., for each individual the homogeneity hypothesis is w;^8umed to be 
satisfied). Thus Rowley's conclusions actually pertain to observation schedules 
for addressing i?uest ion B conditional on Question A being answered affirmatively 
for each individual. Therefore, Rowley's conclusions are not necessarily 
applicable for assessing consistency of individual differences using those 
statistical procedures based on the configuration of the individual time paths 
described in the previous section on indices of tracking. 

Observation Instruments 

As part of the design of a study of stability, the investigator must decide 
which target behaviors to observe and what information to collect on each target 
behavior. For behavior-count data the only information collected is the number 
of occurrences of the target behavior. However, this information is only a 
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fragment of the complete behavioral record; no record is obtained of the 
duration of the target behavior and the time elapsed between occurrences of the 
target behavior* Recording only the frequency of incidence of the target 
behavior, as is done in most classroom observation instruments, precludes many 
detailed statistical analyses of teacher behaviors and, perhaps most importantly 
for this research, precludes assessment of the validity of the assumptions 
(e.g., independence of events, distributional forms) underlying the statistical 
methods used to assess stability (see Appendix A). Similarly, for Bernoulli- 
trial data the sequence of outcomes of the individual trials contains valuable 
information that is lost when only the and n^ are recorded* 

Homogeneous Contexts 

A Key ingredient in studies of stability is designing the study so that a 
focused research question is addressed. Observing teacher behavior in 
homogeneous contexts is a basic requirement of a focused research question. 
Certainly, studies of stability that collect observations in as constant an 
enviroiment as possible (e.g,, group mathematics instruction) should precede 
studies that deliberately confound temporal and contextual facets (e.g., 
combining observations on both mathematics and English instruction over 
occasions). (It would seem unreasonable to expect teachers to be consistent 
over such disparate subject-matter contexts.) An unavoidable confounding occurs 
in studies of year-to-year temporal stability — the group of children in the 
teacher *s class changes with the school year. 

Conclusion 

Previous empirical studies of stability of teacher behavior have been 
limited and unclear. The major weaKness in these studies is the lacK of an 
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explicit definition of stability of teacher behavior. Naturally, an attribute 
cannot be assessed without first being adequately defined. 

Perhaps the most important contribution of this paper has been simply to 
formulate basic research questions about stability (i.e., Questions 7. and b). 
By linking these questions about stability to statistical models for various 
types of classroom observation data, we identify the statistical hypotheses and 
parameters that represent the consistency of teaching behavior over time or 
across contexts. To complete the development, statistical methods for the study 
of stability that follow from these representations are presented and 
illustrated^ 

A most striding consequence of the confusion and ambiguity in research on 
stability of teacher behavior is the absence of research on the consistency of 
the behavior of individual teachers. A concern with the consistency of the 
behavior of individual teachers is seen as far bacR as the writing of Barr 
(1929, especially p. 29). The methods we present should facilitate empirical 
research on the consistency of the behavior of individual teachers and perhaps 
serve the broader purpose of stimulating development of methods for addressing 
other research questions concerning the activities of individual teachers. 

In closing, it is useful to consider what can be gained from this paper's 
contributions to the study of stability of teacher behavior. At the least, this 
paper ties together and demystifies the empirical and methodological literature 
on stability of teacher behavior. At the most, this paper may indicate 
important directions for research on teaching through a better understanding of 
and better methods for the study of the consistency of teaching behavior. In 
1970, Flanders and Simon wrote in the Encyclopedia of Educational Research: 
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the cutting edge o£ research on teaching effectiveness during the next 
decade may be more concerned with variation of teaching behavior between 
visits and with the consequences of this variation compared with the thrust 
of research that existed Jin 1962] when the Gage Handbook went to press. 
(Flanders & Simon, 1970, p. 1423) 

For better or worse, their prediction has not been realized. We cannot judge 

how important the study of stability of teacher behavior will ultimately be for 

teaching effectiveness research. Yet, regardless of whether or not research on 

the "variation" of teaching behavior is to be prominent in research on teaching, 

it's a good idea to get it right. 
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Appendix A 

Statistical Procedures 
In this Appendix we present details of the statistical procedures for 
addressing Question A with Behavior-count or Bernoulli-trial data. For each 
type of data, we present statistical procedures for testing the homogeneity 
hypothesis and for estimating the amount of heterogeneity. In addition, the 
statistical models and the assumptions underlying these procedures are 
discussed. Numerous references to statistical literature on discrete 
distributions and point processes are provided to guide the reader to further 
treatments of relevant technical issues. 

Homogeneity Hypothesis; Behavior-Count Data 

Poisson Model 

The statistical procedures for Behavior-count data are based on the natural 
assumption of a Poisson distribution for the counts. That is, on occasion i , 
the probability of X events occurring in an interval bi is 

X! ~ 

For any single occasion the natural estimate of Ai is the empirical rate of 

A 

observed behavior at the occasion, Xi = ^i/^i » the estimate of is the 

weighted average X = EXi/Zbi (the maximum likelihood estimate under the 

i i 

homogeneity hypothesis) # 

For each teacher, each occasion of observation is assumad to provide an 
independent sample of behavior; that is, the Xi are assumed to be ind.ipendent 
across occasions. Under the homogeneity hypothesis i\i =A for all i) the 
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distribution of each Xi is Poisson with mean biX, and the resulting model is 
the distribution function given above with the common \ replacing Xi (see 
Potthoff and Whittinghill, 1966b9 Equation 1). 

Within an occasion the assumption of a Foissou distribution for the Xi is 
satisfied if the individual events are generated by a Poisson process. The 
Poisson process "plays a role in point process theory in most respects analagous 
to that of the normal distribution in the study of random variables" (Cox & 
Isham, 1980, p. 45). 

Assessments of the validity of assumptions about the distribution of the 
Xi, or about the point process assumed to generate the Xi , requir.^ data on the 
individual events (such as waiting times between events). Cox and Lewis (1966, 
chapter 6) present a number of methods for testing general renewal process 
models and specifically, in section 6.3, tests for Poisson processes are 
presented. In particular, the validity of the assumption of a Poisson 
distribution within each occasion cannot be evaluated from just the Xi and bi . 
In serious empirical work, assessments of the validity of the statistical model 
should be made. Marked deviations from the assumption of a Poisson 
distribution, such as those that may be introduced by severe dependence among 
individual events (see Cox and Lewis, 1966, chapter 2 and 7 for definitions of 
independence and non-independence in series of events), may render tests of the 
homogeneity hypothesis equivocal because positive dependence within occasions 
may not be distinguishable from heterogeneity across occasions. 

T est Statistics 

To test the null hypothesis Xi = X for all i against the general 
alternative that not all the Xi are equal we use the statistic 
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1 i 1 i 1 



(Al) 



which is distributed approximately as X with T 1 degrees of freedom under 
the null hypothesis (see Potthoff and Whittinghill, 1966b). Thus the 
homogeneity hypothesis is rejectedi at level a 9 when the test statistic exceeds 



whether the are more spread out (over occasions) than would be expected 

under the homogeneity hypothesis (given the Poisson model). 

The structure of the test statistic may be understood more clearly from the 
alternative expression: 



Expression (A2) shows that the numerator of the test statistic is a weighted 
variance of the • The b^ , which are known to the analyst , are often 
fixed in advance by the observation schedule. Alternatively, the b^ may be 
determined by the immediate classroom situation (e.g., in the observation of 
teaching behavior during reading instruction in an elementary-school classroomi 
the length of observation depends on how long the teacher carries out reading 
instruction). See also Potthoff and Whittinghill (1966b, p, 183). 

When all the b£ equal one, the test statistic in Expressions Al and A2 
reduces to the familiar "Poisson Index of Dispersion" (also known as the 
•'variance test") introduced by R. A. Fisher (see Fisher, Thornton & Mackenzie, 
1922; Fisher, 1950; Hoel, 1943). This statistic has the simple form 



the critical value x 



T-1 



(a) . The test statistic in Expression Al assesses 



A 




(A2) 
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2(X. - X)^ 

i ^ . (A3) 

X 

In Expression A3 the numerator is the sample variance (over occasions) 
multiplied by T - 1 , and the denominator X is an estimate of the variance 
(within occasions) under the homogeneity hypothesis. 

Potthoff and Whittinghill (1966b) showed that Expression A3 is the locally 
most powerful unbiased test against the negative binomial alternative (i.e., the 
\ follow a gamma distribution). Extensive study has shown the tests based on 
Expressions A3 and Al to have reasonably good power against a variety of 
alternatives (see Bennett and Birch, 1974: Darwin, 1957; Gbur, 1981 J Paul and 
PlacKett, 1978). 

Alternative Statistics . A livelihood ratio statistic for testing this 
homogeneity hypothesis has been presented by Cox and Lewis (1966, Section 9.3, 
Equation 8). This statistic is asymptotically equivalent to Expression Al and 
yields nearly identical results to Expression Al in small samples. Other 
statistics for testing the homogeneity hypothesis are designed to be sensitive 
to the alternative that the follow a gamma distribution as opposed to the 
null hypothesis that = X. (The use of the gamma distribution for the is 
for mathematical convenience because it yields a negative binomial distribution 
for the Xi .) Test statistics designed to be optimal for the gamma alternative 
are examined in Potthoff and Whittinghill (1966b) and Buhler, et al. (1965). 
For applications, the test statistic in Expression Al should be used, unless 
strong reasons exist for positing a gamma distribution for the . 

Estimating Heterogeneity; Behavior-Count Data 
The variance of the distribution of the , a^, represents the 
heterogeneity over occasions of the rate of the target behavior. In this paper 
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we use estimates of 0^ to describe the behavior of individual teachers. The 
statistical model for the estimation of heterogeneity is the same as that usrd 
for testing the homogeneity hypothesis; however, the estimates of h-r %erogeneity 
are much less vulnerable to moderate violations of the statistical model and its 
assumptions* 

Variance Component Estimates > Cox (1955, Section 5.3) developed useful 

2 

estimates of CT^ , the simplest of which is: , 

(d - 1)(T - 1)X ^ (A4) 

lb. - 2:b?/2:b. 
i ^ i ^ i ^ 

where d is the test statistic in Expression Al divided by T - 1 . The 
estimate for used in the analyses presented in this paper is an adaptive 
estimator related to Expression A4 (for details see Cox, 1955, Section 5.3). 
Very small values of the test statistic in Expression Al may result in 
corresponding estimates of 0^ that are negative. In such cases the estimate 
of CT^ is set to zero, as is consistent with a failure to reject the 
homogeneity hypothesis. 

Negative Binomial Estimates ^ When all the b^ « 1 and an assumption of a 
gamma distribution for the can be made, estimation of the variance of the 

gamma distribution is based on estimation for the re8ultin:«i negative binomial 

2 / 

distribution of the . Hence, a method of moments estimate for 0^ (termed 
the "Evaiis-Andcombe" estimate by Shenton and Myers, 1963) is simply 

i X • 

T 

The maximum likelihood estimate for 0^ under these assumptions was developed 
by Fisher (1950, 1953; see also Bliss, 1953). Although iterative methods are 
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required, the computation of the maximum livelihood estimate is straightforward 
(see Johnson and Kotz, 1969, Section 3.8). 

Homogeneity Hypothesis; Bernoulli-Trial Data 

Binomial Model 

The statistical procedures for Bernoulli-trial data are based on the 
assumption that the sura of the outcomes of the Bernoulli-trials on any single 
occasion follow a binomial distribution. That is, for occasion i the 
probability that X successes occur in the nj. trials is 



L 

X 



(TT.)\l - 7T.)V^ 
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For any single occasion the natural estimate of 71^ is the empirical proportion 

of successes, pj. = Xi/ni , and the estimate of is the weighted average 

p = SXi/Sni (the maximum livelihood estimate under the homogeneity hypothesis), 
i i 

For each teacher, each occasion of observation is assumed to provide an 
independent sample of behavior, yielding, for each teacher, T independent 

samples of sizes n^^ , n^ vi^ . Under the homogeneity hypothesis ( 71^ = TT 

for all i) the distribution of each X^ is binomial with the same parameter IT, 
and the resulting model is the distribution function given above with the common 
TT replacing tt^ (see Potthoff and Whittinghill , 1966a, Equation 1). 

Within an occasion, the assumption of a binomial distribution for the sum 
of the Bernoulli trails is satisfied if the Bernoulli-trials are identically and 
independently distributed (i.i.d.). Assessments of the validity of the 
assumptions about the structure of individual events require data on the 
individual events. That is, the validity of the assumption of a binomial 
distribution within each occasion cannot be evaluated from just the Xj, and n^. 
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Various models for dependence among the Bernoulli-trials are studied by Altham 
(1978) and Klotz (1973); Klotz (1973, Section 6)" and Tarone (1979, Section 2) 
developed estimation and testing procedures for detecting dependence of the 
Bernoulli-trials within occasions. Violations of the assumption of i.i.d. 
Bernoulli-trials are important for the statistical procedures we use only 
insofar as the assumption of a binomial distribution for the is undermined 
(especially with regard to the variance of Xi being niiTid - TTi)). Minor 
violations of the assumption of i.i.d. Bernoulli-trials will not greatly affect 
even the statistical tests of the homogeneity hypothesis. However, in serious 
empirical worK, assessments of the validity of assumption of i.i.d^ Bernoulli 
trials should be made. MarKed deviations from the assumption of a binomial 
distribution, such as those that may be introduced by severe dependence amongst 
the Bernoulli-trials, may render tests of the homogeneity hypothesis equivocal, 
as positive dependence within occasions may not be distinguishable from 
heterogeneity across occasions. 

Test Statistics 

To test the null hypothesis that iri = 7T for all i against the general 
alternative that not all the TTi are equal we use the "Binomial index of 
dispersion" 

T (X. - n.;)2 ^ (A6) 
i=l n^pd - p) 

2 

which is distributed approximately (for ni not small) as X with T - 1 degrees 
of freedom under the null hypothesis (see Hoel, 1943; Potthoff & Whittinghill, 
1966a; WisniewsKi, 1968, 1972). (A familiar use for this statistic is in 
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testing for "independence" in a 2 x T contingency table} see, for example, 
Snedecor and Cochran, 1980, Section 11.7). Thus the homogeneity hypothesis is 
rejected, at level a, when this test statistic exceeds the critical value 
X^^j^Ca) . This test statistic assesses whether the pj, are more spread out 
(over occasions) than would be expected under the homogeneity hypothesis (given 
the binomial model). An interesting historical note is that the test statistic 
in Expression A6 divided by its degrees of freedom is the Lexis quotient, which 
was prominent in the late nineteenth and early twentieth centuries in the study 
of consistency or stability of statistical series (see Bortniewicz, 1931; 
Forsyth, 1932, 1937; Heyde and Seneta, 1977; Lexis, 1877). 

For the special case of ni =» n Potthoff and Whittinghill (1966a) and Gart 
(1970) demonstrated that the statistic in Expression A6 is optimal, in the sense 
of locally most powerful unbiased, against the beta-binomial alternative. Power 
functions for the test based on Expression A6 for a variety of alternative 
distributions have been studied by Bennett and Kaneshiro (1978) and Wisniewsfti 
(1972) 

Alternative Statistics . Another statistic for testing the homogeneity 
hypothesis is obtained from the livelihood ratio criterion (Bennett & Kaneshiro, 
1978, Equation 4). This test statistic is asymptotically equivalent to 
Expression A6, and the numerical results of Bennett and Kaneshiro (1978) show 
that small sample properties favor the use of Expression A6. Other statistics 
for testing the homogeneity hypothesis are designed to be sensitive to the 
alternative that the tt^ follow a beta distribution as opposed to the null 
hypothesis that TT^ = n. The use of the beta distribution is for mathematical 
convenience as it yields a beta-binomial distribution for the . An 
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asymptotically optimal test statistic for the beta-binomial alternative is 
derived in Section 3 of Tarone (1979); see also Gart (1970), Potthoff and 
Whittinghill (1966a) and WisniewsKi (1968). For applications, Expression A6 
should be used unless strong reasons exist for positing the beta distribution 
for the IT^ • 

Estimating Ueterogeneitye PQrnoulli-Trial Data 

The variance of the distribution of the TT^ , , represents the 

heterogeneity over occasions for a behavior having the form of a Bernoulli 

trial. In this paper we use estimates of to describe the behavior of 

II 

individual teachers. The estimation of heterogeneity relies on the binomial 
model and its assumptions; however, the estimates of heterogeneity are much less 
vulnerable than the test of the homogeneity hypothesis to moderate violations of 
the model. 

Three estimates of have been developed in the statistical literature. 

The simplest estimate, developed by HendricKs (1935), is 

{2(p. - p)^ - p(i - p)Z(l/n. )}/T (A7) 
i ^ 1 ^ 

and can be derived by inverting the Lexis Formula. Another estimate of 0^, 
obtained by Robertson (1951), is , 

{d - (T - l)}p(l - p) , (A8) 

2ni - (2u|/Eni) - (T -1) 
i i i 

where d is the statistic in Expression A6 divided by T - 1 . This estimate 
can be derived by solving for in the expectation of the test statistic in 
Expression A6. When the value of the test statistic in Expression A6 is very 
small, eitl- or both of Expressions A7 and A8 may produce negative estimates of 
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af • In such cases the estimate of should be set to zero, as is consistent 
with a failure to reject the homogeneity hypothesis* 

Kleinman (1973) developed an iterative estimate of i based on method of 
moments estimation for the beta-binomial distribution (defined by Equations 2.5 
through 2.8 in Kleinman, 1973). In practice, the simple estimate of HendricKs 
(Expression A7} agrees closely with Kleinman* s estimate, whereas Robertson's 
estimate (Expression A8) is consistently larger than the other two. In the 
examples in the text, Kleinman* s estimate is used. Kleinman* s estimate is 
constrained to produce only non-negative estimates of 0^ • 
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Texas Junior Hinh School Study 

During the 1974-1975 school year, a large-scale process-product study at 
the junior-high level was conducted by the Research and Development Center for 
Teacher Education (Evertson & Veldman, 1981; Evertson et al., 1977). Data 
collection began in November 1974 and continued until mid-May 1975. Each of 68 
seventh- and eight-grade teachers (29 math and 39 English) were observed in two 
of their classes. In each of the 136 classes about 20 one-hour observations 
were conducted, yielding data on a variety of high-inference (global rating) and 
low-inference (frequencies of behavior) measures. 

In our use of these data the original 20 distinct occasions of observation 
have been combined by pooling observations within a calendar month into 6 
different occasions (November through April). There is nothing superior or 
necessarily desirable about this grouping of the one-hour observations; our 
analyses use the grouped data only bec<\use this grouping was also present in the 
analyses of the Texas Junior High School data by Evertson and Veldman (1981), 
from whom we obtained these data. (Also, the data we obtained were in the form 
of monthly rates of behavior; we determined the and b^ for the monthly data 
from the fractional rates.) Data from the Texas Junior High School Study are 
used in Tables 1 and 5 (the low-inference variable. Behavioral Criticism), 
Figure 3 and Table 6 (the low-inference variable, Product Questions), Table 4 
(the high-inference variable. Positive Affect), and in the section Stability 
Across Contexts (the low-inference variables. Product Questions and Call-outs). 
In Tables 1, 4, and 5 we retain the original teacher-class identification codes 
from the Texas Junior High School Study. For example, the identification number 
21052 in Table 1 denotes observations of English teacher number 05 in school 1 
for the class scheduled during the second period of the school day. 
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Moon's Elementary-School Science Study 
Moon (1969, 1971) studied the verbal behavior patterns of 32 mid-Michigan 
elementary-school teachers. Sixteen of these teachers used a new science 
curriculum (SCIS), while the other 16 "conventional" teachers served as a 
control group. Each of the 16 teachers of the SCIS curriculum was observed 
twice in the spring of 1968, before the summer workshop on the new curriculum, 
and on four occasions during the following school year. The control teachers 
were observed only twice. 

Audio recordings of the science lessons at each of the six observation 
occasions were coded using both Flanders' interaction-analysis system and an 
independently developed instrument for counting and classifying teacher 
questions. Moon's data on teacher questions have been used to compute 
"stability" coefficients by Rosenshine (1973), who used the last four 
observations on the SCIS teachers, and by Shavelson and Dempsey-Atwood (1976), 
who used the observational data on both the SCIS and conventional teachers. 
Shavelson and Dempsey-Atwood also reported "stability coefficients" for the 
indirectness-directness (I/D) ratio obtained from Flanders* instrument. 

From Moon's original coding sheets we were able to recover counts of each 
of the five teacher-question types for each of the six observation occasions. 
In Tables 2, 7, 8, 9, and 10 and in Footnote 12, the data on teachers' use of 
recall-facts questions (lower-order questions) on the last four observation 
occasions (those following the summer workshop) were used. Also, Table 3 
presents five data points for Flanders' indirectness-directness (I/D) ratio. 
The first datum is the average of the indirectness-directness (l/D) ratios for 
the two pre-workshop observations; the remaining data points represent 
observations 3 through 6. 
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Trinchero'a Student Teacher Study 
Trinchero (1974) used videotapes of English and social-studies student 
teachers for his dissertation study. The tapes were made as part of the 
Stanford Teacher Education Program during the 1967-1968 school year. Each 
stuaent teacher taught pre-set lessons to a group of 9th and 10th graders, who 
were paid volunteers. In 1972 Trinchero had observers code these tapes, 
recording counts for different categories of teacher questions for three 
occasions of observation. Using data for two occasions, Shavelson and Dempsey- 
Atwood (1976, Table 1) report "stability coefficients" for a variety of these 
teacher questioning behaviors. The data on teacher questions provided the 
Bernoulli-trial data for lower-order questions used in Table 10. 
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